104 research outputs found

    A conditional compression distance that unveils insights of the genomic evolution

    Full text link
    We describe a compression-based distance for genomic sequences. Instead of using the usual conjoint information content, as in the classical Normalized Compression Distance (NCD), it uses the conditional information content. To compute this Normalized Conditional Compression Distance (NCCD), we need a normal conditional compressor, that we built using a mixture of static and dynamic finite-context models. Using this approach, we measured chromosomal distances between Hominidae primates and also between Muroidea (rat and mouse), observing several insights of evolution that so far have not been reported in the literature.Comment: Full version of DCC 2014 paper "A conditional compression distance that unveils insights of the genomic evolution

    Information profiles for DNA pattern discovery

    Full text link
    Finite-context modeling is a powerful tool for compressing and hence for representing DNA sequences. We describe an algorithm to detect genomic regularities, within a blind discovery strategy. The algorithm uses information profiles built using suitable combinations of finite-context models. We used the genome of the fission yeast Schizosaccharomyces pombe strain 972 h- for illustration, unveilling locations of low information content, which are usually associated with DNA regions of potential biological interest.Comment: Full version of DCC 2014 paper "Information profiles for DNA pattern discovery

    Histogram packing, total variation, and lossless image compression

    Get PDF
    Publication in the conference proceedings of EUSIPCO, Toulouse, France, 200

    Compression of Microarray Images

    Get PDF

    Smash plus plus : an alignment-free and memory-efficient tool to find genomic rearrangements

    Get PDF
    Background: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results: We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions: Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was similar to 1 GB, which makes Smash++ feasible to run on present-day standard computers.Peer reviewe

    Competitive Segmentation Performance on Near-lossless and Lossy Compressed Remote Sensing Images

    Get PDF
    Image segmentation lies at the heart of multiple image processing chains, and achieving accurate segmentation is of utmost importance as it impacts later processing. Image segmentation has recently gained interest in the field of remote sensing, mostly due to the widespread availability of remote sensing data. This increased availability poses the problem of transmitting and storing large volumes of data. Compression is a common strategy to alleviate this problem. However, lossy or near-lossless compression prevents a perfect reconstruction of the recovered data. This letter investigates the image segmentation performance in data reconstructed after a near-lossless or a lossy compression. Two image segmentation algorithms and two compression standards are evaluated on data from sev- eral instruments. Experimental results reveal that segmentation performance over previously near-lossless and lossy compressed images is not markedly reduced at low and moderate compression ratios. In some scenarios, accurate segmentation performance can be achieved even for high compression ratios

    Lossy-to-Lossless Compression of Biomedical Images Based on Image Decomposition

    Get PDF
    The use of medical imaging has increased in the last years, especially with magnetic resonance imaging (MRI) and computed tomography (CT). Microarray imaging and images that can be extracted from RNA interference (RNAi) experiments also play an important role for large-scale gene sequence and gene expression analysis, allowing the study of gene function, regulation, and interaction across a large number of genes and even across an entire genome. These types of medical image modalities produce huge amounts of data that, for several reasons, need to be stored or transmitted at the highest possible fidelity between various hospitals, medical organizations, or research units

    A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

    Get PDF
    The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe

    Dissimilar Symmetric Word Pairs in the Human Genome

    Full text link
    In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study on the complete human genome and its repeat-masked version.Comment: Submitted 13-Feb-2017; accepted, after a minor revision, 17-Mar-2017; 11th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2017, Porto, Portugal, 21-23 June, 201
    corecore